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In observational studies, treatments are typically not random¬ 
ized and therefore estimated treatment effects may be subject to 
confounding bias. The instrumental variable (IV) design plays the 
role of a quasi-experimental handle since the IV is associated with 
the treatment and only affects the outcome through the treatment. In 
this paper, we present a novel framework for identification and infer¬ 
ence using an IV for the marginal average treatment effect amongst 
the treated (ETT) in the presence of unmeasured confounding. For 
inference, we propose three different semiparametric approaches: (i) 
inverse probability weighting (IPW), (ii) outcome regression (OR), 
and (iii) doubly robust (DR) estimation, which is consistent if either 
(i) or (ii) is consistent, but not necessarily both. A closed-form locally 
semiparametric efficient estimator is obtained in the simple case of 
binary IV and outcome and the efficiency bound is derived for the 
more general case. 


1. Introduction. Sociology and epidemiology studies often aim to eval¬ 
uate the effect of a treatment. For practical reasons, the average treatment 
effect among treated individuals (ETT) is sometimes of greater interest than 
the treatment effect in the population. For example, in epidemiology studies 
concerning the toxic effects of a new drug or in sociology studies evaluating 
the effects of a policy among those whom the policy is applied to, the ETT 
is the parameter of interest. 

In observational or randomized studies with non-compliance, a primary 
challenge is the presence of unmeasured confounding, i.e. outcomes between 
treatment groups may differ not only due to the treatment effect, but also 
because of unmeasured factors that may affect the treatment selection. 
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Instrumental variables (IY) are useful in addressing unmeasured con¬ 
founding. An IV is a variable that is associated with the treatment and 
it affects the outcome only through the treatment. The key idea of the 
IV method is to extract exogenous variation in the treatment that is uncon¬ 
founded with the outcome and to take advantage of this bias-free component 
to make causal inference about the treatment effect (Robins, 1989; Angrist, 
Imbens and Rubin, 1996; Heckman, 1997). 

The development of the IV approach can be traced back to Wright (1928) 
and Goldberger (1972) under linear structural equations in econometrics. 
Imbens and Angrist (1994), Angrist, Imbens and Rubin (1996) and Heck¬ 
man (1997) formalized the IV approach within the framework of potential 
outcomes or counterfactuals. Robins (1989) and Robins (1994) evaluated the 
average treatment effect among treated individuals (ETT) conditional on 
the IV and observed covariates under additive and multiplicative structural 
nested models (SNMs). Identification is achieved by assuming a certain de¬ 
gree of homogeneity with regard to the IV in an SNM of the conditional ETT 
(Hernan and Robins, 2006). Mainly, the assumption states that the magni¬ 
tude of the conditional ETT does not vary with the IV. This is also referred 
to as the no-current treatment value interaction assumption. Under a similar 
identifying assumption, Vansteelandt and Goetghebeur (2003), Robins and 
Rotnitzky (2004), Tan (2010), Clarke, Palmer and Windmeijer (2014) and 
Matsouaka and Tchetgen Tchetgen (2014) investigated estimation of this 
conditional causal effect using additive, multiplicative and logistic SNMs. 1 

The literature mentioned above has some limitations. First of all, the 
literature focuses on the ETT conditional on the IV and observed covariates. 
The identification of such conditional ETT was achieved by specifying a 
functional form of the treatment causal effect. This is unattractive since 
it places constraints directly on the main parameter of interest and the 
nrisspecification of this functional form would lead to biased result. Second, 
the available inference methods require the treatment propensity score to 
be correctly specified even for an outcome regression-based estimator (Tan, 
2010 ). 

In this paper, we remedy these limitations in a novel framework for iden¬ 
tification and estimation using an IV of the marginal ETT in the presence 

1 In another line of research, Imbens and Angrist (1994) and Angrist, Imbens and Ru¬ 
bin (1996) defined the treatment effect on individuals who would comply to their assigned 
treatment. Under a monotonicity assumption about the effect of the IV on exposure, the 
complier average treatment effect can be identified. Further research along these lines in¬ 
clude fully parametric estimation strategies (Tan, 2006; Barnard et al., 2003; Frangakis 
et al., 2004) as well as semiparametric methods (Abadie, 2003; Abadie, Angrist and Im¬ 
bens, 2002; Tan, 2006; Ogburn, Rotnitzky and Robins, 2014). 
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of unmeasured confounding. By targeting directly the marginal ETT, we 
allow the conditional causal effect to remain unrestricted. Our methods are 
particularly valuable when the primary goal is to obtain an accurate esti¬ 
mate of the treatment effect. Additionally, we propose a new identification 
strategy which is applicable to any type of outcome, and provides neces¬ 
sary and sufficient global identification conditions. Moreover, for inference, 
we propose three different semiparametric estimators allowing for flexible 
covariate adjustment, (i) inverse probability weighting (IPW), (ii) outcome 
regression (OR) and (iii) doubly robust (DR) estimation which is consistent 
if either (i) or (ii) is consistent but not necessarily both. 

The outline for the paper is as follows. In Section 2, we introduce the no¬ 
tation and state the main assumptions. We study the nonparametric identifi¬ 
cation of ETT in Section 3. We introduce IPW, OR as well as DR estimators 
in Section 4. In Section 5, we assess the performance of various estimators 
in a simulation study. In Section 6, we further illustrate the methods with 
a study concerning the impact of participation in a 401 (k) retirement pro¬ 
grams on savings. We conclude with a brief discussion in Section 7. 

2. Preliminary Results. Suppose that one observes independently 
and identically distributed data O = (A, Y, Z, C ), where A is a binary treat¬ 
ment, Y is the outcome of interest and (Z,C) are pre-exposure variables. 
Let a, y, z, c denote the possible values that A, Y, Z. C could take. Let Y az 
denote the potential outcome if A and Z are set to a and z and let Y a denote 
the potential outcome only A is set to a. We formalize the IV assumptions 
using potential outcomes: 

(IV. 1) Stochastic exclusion restriction: 

Y az = Y a almost surely for all a and z; 

(IV.2) Unconfounded IV-outcome relation: 

fY 0 \z,c{y\ z ,c) = f Yo \c(y\ c ) for all z and c; 

(IV.3) IV relevance: 

Pr(A = 1| Z = z,C = c) ^ Pr(A = 1\Z = 0, C = c) for all z ^ 0 and c. 

Assumption (IV. 1) states that Z does not have a direct effect on the 
outcome Y thus we use Y a to denote the potential outcome under treatment 
a for a = 0,1. Assumption (IV.2) is ensured under physical randomization 
but will hold more generally if C includes all common causes of Z and 
Y. Assumptions (IV.1)—(IV.2) together imply that conditional on C, the 
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IV is independent of the potential outcome for the unexposed, i.e., Yq X 
Z\C. Assumption (IV.3) states that A and Z have a non-null association 
conditional on C, even if the association is not causal. If assumptions (IV. 1)- 
(IV.3) are satisfied, Z is said to be a valid IV. 

We make the consistency assumption Y = AY\ + (1 — A)Yq almost surely. 
The marginal treatment effect on the treated is ETT = E(Y\ — Yq\A = 1). 
Because E(Y\\A = 1) = E(Y\A = 1) can be consistently estimated from the 
average observed outcome of treated individuals, throughout, we focus on 
making inferences about where 

V’ = e(Yq\a = i). 

Suppose there exist unmeasured variables denoted by U such that control¬ 
ling for (U,Z,C) suffices to account for confounding, i.e. Yq X A\(U, Z,C), 
however, 

(2.1) Y 0 JLA\(Z,C), 

where X denotes statistical independence. As pointed out by Robins, Rot- 
nitzky and Scharfstein (2000), potential outcomes can be viewed as the ulti¬ 
mate unmeasured confounders. This is because by the consistency assump¬ 
tion, the observed outcome Y is a deterministic function of the treatment 
and the potential outcomes. Thus, given (Yq. Y\), U does not contain any 
further information about Y. To make explicit use of (2.1), we define the 
extended propensity score tt(Yq. Z, C) = Pr(A = 11Vo, Z , C) as a function of 
Yq. 

3. Nonparametric Identification. While assumptions (IV.1)-(IV.3) 
suffice to obtain a valid test of the sharp null hypothesis of no treatment 
effect (Robins, 1994) and can also be used to test for the presence of con¬ 
founding bias (Pearl, 1995), ETT is not uniquely determined by the observed 
data without any additional restriction. For simplicity, we first consider the 
situation where covariates are omitted and outcome and IV are both binary. 
From the observed data, one can identify the quantities Yt{Yq,Z\A = 0), 
Pr(Z|A = 1) and Pr(A = 0). These quantities are functions of the unknown 
parameters: Pr (Z = 1), Pr(Yo = 1), and Pr(A = 0\Yq,Z). Without impos¬ 
ing any additional assumption, there are six unknown parameters (one for 
Pr (Z = 1), one for Pr(Yo = 1) and four for Pr(A = 0|Yo,Z)), however, 
only five degrees of freedom are available from the observed data (one for 
Pr(A = 0), one for Pr(Z|A = 0) and three for Pr(Y, Z\A = 0)). As a result, 
the joint distribution f(A,Yo,Z ) is not uniquely identified. Particularly, i/j 
is not identified. 
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For identification purposes, additional assumptions, such as Robins’ no¬ 
current treatment value interaction assumption (Hernan and Robins, 2006), 
must be imposed to reduce the set of candidate models for the joint dis¬ 
tribution f(A,Yo,Z,C). Below, we give a general necessary and sufficient 
condition for identification. Let V A \y 0 ,z,c and 7V 0 |c denote the collections 
of candidates for Pr(A = 0|Yo, Z. C) and f(Yo\C), which are known to satisfy 
(IV. 1) and (IV.2). 

Condition 1. Any two distinct elements Pit (A = 0|Yo, Z, C ), Pit (A = 
0\Y 0 ,Z,C) € V A \y 0> z,C and h( Y o\C), f 2 (Xo\C) € V Yo \c, satisfy the inequal¬ 
ity: 

Pr 1 (A = Q|V 0 ,Z,C) h(Yp\C) 

Pr 2 (A = 0|Yo) Z, C) * h{Y 0 \Cy 

The following proposition states that condition 1 is a necessary and suf¬ 
ficient condition for identifiability of the joint distribution of (A,Yq, Z,C), 
where Yo and Z may be dichotomous, polytomous, discrete or continuous. 

Proposition 1. The joint distribution of (A,Yq, Z,C) is identified in 
the model defined by Vm y 0 ,z,C and ^V 0 ic if and on ly if condition 1 holds. 

It is convenient to check condition 1 for parametric models, but it may be 
harder for semiparametric and nonparametric models, since Vmy 0 ,z,c and 
Vy 0 \c can be complicated. The following corollary gives a more convenient 
condition. 

Corollary 1. Suppose that for any two candidates Pri(A = 0|Yo, Z , C ), 
Pr 2 (A = 0\Y 0 ,Z,C) e V A \y 0 ,z,c, the ratio Pri(A = 0| Yo, Z, C)/ Pr 2 (A = 
0|Lo, Z, C ) is either a constant or varies with Z. Then the joint distribution 
of (A, Yq. Z, C) is identified. 

Although the condition provided in Corollary 1 is a sufficient condition for 
identification, it allows identification of a large class of models. We further 
illustrate Proposition 1 and Corollary 1 with several examples. For simplic¬ 
ity, we again omit covariates, however, we show at the end of this section 
that similar results with covariates can be derived. We first consider the case 
of binary outcome with binary IV. 

Example 1. Consider a model Va\y 0 ,z = {Pr(A = 0| Yo, Z) : logit Pr(A = 
0|Yo, Z\6\, 02, t/i, r? 2 ) = di + 0 2 Z + r/i Yq + rj 2 Y 0 Z, 6h, 0 2 , 771 , 7/2 € (- 00 , 00 )}. 
The model is saturated since V A \y 0 ,z contains all possible treatment mecha¬ 
nisms. It can be shown that neither the joint distribution nor -*/> is identified 
even under the assumptions (IV.1)-(IV.3). 
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Example 1 shows that the joint density f(A,Yo, Z ) is not identified when 
the treatment selection mechanism is left unrestricted under (IV.1)-(IV.3). 
However, we show that the joint density f(A,Yo,Z) is identified assuming 
separable treatment mechanism on the additive scale. 

Example 2. Consider a model Va\y 0 ,z = {Pr(R = 0|lo, Z) : logit Pr(H = 
0\Y 0 ,Z\0i,6 2 ,rn) = Oi + 0 2 Z + r]iY 0 ;di,92,rii G (- 00 , 00 )}. The model is 
separable since Va\y 0 ,z excludes an interaction between Yq and Z. It can be 
shown that both the joint distribution and 0 is identified under assumptions 
(IV.1)-(IV.3). 

Example 2 agrees with the intuition that identification follows from hav¬ 
ing fewer parameters than the saturated model. Under the assumed model, 
we have five unknown parameters and five available degrees of freedom from 
the empirical distribution. We show in the next example that the joint dis¬ 
tribution and 0 can be identified in a general separable model when the 
outcome and instrument are both continuous. 

Example 3. Consider the logistic separable treatment mechanism: Vmy 0 ,z 
= {Pr(H = 0|Yo,Z) : logit Pr(H = 0\Yq, Z) = q(Z) + h(Yo)}, where q and 
h are unknown differentiable functions with h( 0) = 0. It can be shown that 
'P AIYq.z satisfies condition 1 and thus the joint distribution is identified under 
(IV.lj-(IV.3). 

These results can be generalized to include covariates C. For instance, by 
allowing both q and h to depend on C in example 3: 

V A \Y 0 ,z,C = = 0|y 0 , Z, C ) : logit Pr (A = 0\Y 0 , Z, C) = q{Z, C)+h(Y 0 , C)}, 

where h( 0, C ) = 0, the joint distribution is identified whenever the interac¬ 
tion term of Yo and Z is absent. 

In the Supplementary Materials, we present proofs for the above examples, 
and additional examples, such as the case of continuous outcome with binary 
IV, and a separable treatment mechanism. 

4. Estimation. While nonparametric identification conditions are pro¬ 
vided in Section 3, such conditions will seldom suffice for reliable statisti¬ 
cal inference. Typically in observational studies, the set of covariates C is 
too large for nonparametric inference, due to the curse of dimensionality 
(Robins and Ritov, 1997). To make progress, we posit parametric models 
for various nuisance parameters, and provide three possible approaches for 
semiparametric inference that depend on different subsets of models. We 
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describe an IPW, an OR and a DR estimator of the marginal ETT under 
assumptions (IV.1)-(IV.2) and condition 1. Throughout, we posit a para¬ 
metric model fz\c( z \ c ) = P r (Z = z\C = c; p) for the conditional density 
of Z given C. Let p denote the maximum likelihood estimator (MLE) of p. 
Let P n denote the empirical measure, that is P n /(0) = n~ l Y^=i f(Oi)- Let 
E denote the expectation taken under the empirical distribution of C and 
let Pr(A = 1) = denote the empirical probability of receiving 

treatment. 

4.1. IPW estimator. For estimation, we first propose an IPW IV ap¬ 
proach which extends standard IPW estimation of ETT to an IV setting. 
We make the positivity assumption that for all values of Yq, Z and C the 
probability of being unexposed to treatment is bounded away from 0. The 
IPW approach relies on the crucial assumption that the extended propensity 
score model 7 t(Yo, Z, C; 7 ) is correctly specified with unknown finite dimen¬ 
sional parameter 7 and the following representation of ETT, 


(4.1) 


E(Y 0 \A = 1 ) = E 


f 7t(Y 0 ,Z,C)Y(1-A) 

l Pr(A = 1){1 — vr(Yo, Z, C)} 


A derivation of the above equation is given in the Supplementary Mate¬ 
rials. We solve the following equations to obtain an estimator 7 of 7 : 


(4.2) 

Pn{ 

(4.3) 

Pn[ 

(4.4) 

Pn[ 

(4.5) 

Pn[ 


1 - A 


1 — 7 t(Yo, Z, C; 7 ) 
1 -A 

1 — 7t(Yo, Z, C;j) 
1 - A 

1 — tt(Yo, Z, C;j) 
l - A 

1 — 7 r(lo, Z, C] 7 ) 


- 1 } = 0 , 

{h 1 (Z,C)-E(h 1 (Z,C)\C-p)}} =0, 

{h 2 {C) - E{h 2 {C))}] =0, 

t(Y, C){l(Z , C) - E(l(Z , C)\C- p)}} = 0, 


where (hj, ,l T ) T satisfies the regularity condition (A.l) described in the 
Supplementary Materials. Equations (4.3) and (4.4) identify the association 
between (Z,C) and A in n(0,Z,C). By leveraging the IV property (IV.1)- 
(IV.2), equation (4.5) identifies the degree of selection bias encoded in the 
dependence of it on Yq. By equation (4.1), an extended propensity score 
estimator leads to an estimator of i/j. We have the following result: 
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Proposition 2. Under (IV.1)-(IV.2) and condition 1, suppose the ex¬ 
tended propensity score model 7t(Yq, Z, C;'y) and fz\c{ z \ c \P) are correctly 
specified, then the IPW estimator 

7 ipw = p *(Y 0 ,Z,Cn)Y(l-A) 

n Pr(A = l){l-7r(y 0 ,Z,C';7)}’ 

is consistent for ijj. 

We emphasize that the extended propensity score model can use any well- 
defined link function (e.g., logit, probit), and if condition 1 holds, Propo¬ 
sition 2 still holds. The functions hi, h 2 , t and l can be chosen based 
on the model for the extended propensity score. For example, assuming 
logit 7r(y 0 , Z, C; 7 ) = #0 + 9iZ + 0 2 C + r]Y 0 where fj = (9i,9 2 , r?) T is a k- 
dimensional parameter vector. The fc-dimensional function (h\,h 2 ,t) T can 
be chosen as (h\,h 2 ,t) T = <91ogit 7 t ( 1 oj Z,C-,^)/dfj = (Z, C, Yq) t and l can 
be chosen as any scalar function of (Z, C), e.g., l(Z, C ) = Z. Thus we have 
exactly k+ 1 estimating equations. The choice of h\, h 2 , t and l will generally 
impact efficiency but should not affect consistency as long as the identifica¬ 
tion conditions hold and the required models are correctly specified. 

4.2. OR and DR estimators. Since Yq is never observed for the treated 
group, we parameterize E[Yq\A = 1 ,Z,C] into two parts: one can be es¬ 
timated directly using restricted MLE and the other can be computed by 
solving an estimating equation. Specifically, we have 


(4.6) E{g(Y 0 ,C)\A = l,Z,C} 


E\exp{a(Y,Z, C)}g(Y,C)\A = 0 ,Z,C\ 
U[exp{a(Y, Z, C)}\A = 0, Z, C} 


where g is any function of Yq and C and a (To, Z, C) is the generalized odds 
ratio function relating A and Yq conditional on Z and C as 


a(Y 0 ,Z,C) 


f(Y 0 \A = 1 , Z, C)f(Y 0 = 0\A = 0, Z, C) 
f(Y 0 \A = 0, Z, C)f(Y 0 = 0\A = 1, Z, C )' 


Since the association between Yq and A is attributed to unmeasured con¬ 
founding, a(Yo, Z, C) can be interpreted as the selection bias function. Thus, 
we express the conditional mean function E{g(YQ, C)\A = 1, Z, C} in terms 
of f(Y\A = 0 ,Z,C) and a(Yo, Z, C). We prove the equation (4.6) in the 
Supplementary Materials. 

Let /(y|^4 = 0 ,Z,C;£) denote a model for the density of the outcome 
among the unexposed conditional on Z and C, and let £ denote the restricted 
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MLE of £ obtained using only data among the unexposed. Let g denote the 
parameter indexing a parametric model for the selection bias function a as 
a(Y 0 , Z,C;g). We obtain an estimator for g by solving: 

(4.7) 

{w(Z, C)—E(w(Z, C)\C- p)} { AE[g(Y 0 , C)\A = 1, Z, C; g, £}+(l-A)g(Y, C ) 


for any choice of functions w and g such that the regularity condition (A.2) 
stated in the Supplementary Materials holds. Intuitively, the left hand side 
of equation (4.7) is an empirical estimator of the expected conditional co- 
variance between g(Yo,C) and w(Z,C) given C, which should be zero by 
(IV. 1)—(IV.2). Based on equation (4.6), we can construct an estimator for i/j 
based on g, £ and p. 


Proposition 3. Under (IV.1)-(IV.2) and condition 1, suppose ol(Yq, Z, C\ g), 
fz\c( z \ c 'i P) an d /(V \A = 0, Z, C] £) are correctly specified, then the OR es¬ 
timator 

Areg = p A U[exp{«(V, Z, C; g)}Y\A = 0, Z, C; £] 
n Pr(A = 1) E[exp{a(y, Z, C;f))}\A = 0, Z, C; £] ’ 

is consistent for ijj. 


Functions g and uj in equation (4.7) can be chosen based on the model we 
posit for a(Yo, Z, C). For example, assuming 

(4.8) a(Y 0 ,Z,C;g) = gY 0 , 

g can be chosen as g(Yo, C) = da(Yo, Z, C]g)/dg = Yq and u can be chosen 
as any scalar function of (Z, C), e.g., cu(Z, C) = Z. The choice of g and uj may 
impact efficiency but does not affect consistency as long as the identification 
conditions hold and the required models are correctly specified. 

Tan (2010) proposed an OR estimator for the conditional ETT, which re¬ 
quires correctly specified models for both the treatment propensity score and 
the outcome regression function. In contrast, we circumvent the dependence 
of the regression estimator on the propensity score. 

Note that the proposed estimator for nuisance parameter g is closely re¬ 
lated to the regression estimator proposed by Vansteelandt and Goetghe- 
beur (2003) when Y is binary. Vansteelandt and Goetghebeur (2003) devel¬ 
oped a two-stage logistic estimator which combines a logistic SMM at the 
first stage and a logistic regression association model at the second stage. 
Specifically, Vansteelandt and Goetghebeur (2003) focused on estimating 
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C{Z,C) = logit Pr(Y] = 1\A = 1 ,Z,C) - logit Pr(F 0 = 1\A = 1 ,Z,C), 
which encodes the conditional ETT given Z and C. Let v denote the param¬ 
eter indexing a model for ((Z, C) as C(Z, C\v). They proposed to estimate 
v in the estimating equation 

(4.9) 

Pn{ {w(z, C)—E(w(Z, C)\C ; p)) f Aexpit{0(Z, C; e)~C(Z, C; i/)}+(l -A)Y^j } 

where expit (x) = exp(a:)/{l + exp(x)} and $(Z, C; g) = logit Pr(T = 1| A = 

1; Z. C\ q). 

Recall that we obtain an estimator of rj indexing a(Yo, Z,C;rj) in the 
equation (4.7), which can be re-expressed as 

(4.10) 

Pn{ (w(Z, C)—E(w(Z, C)\C ; p)) ^expit{5(Z, C; £)+a(l, Z, C; r/)}+(l -A)y\ 

where 6(Z, C; |) = logit Pr(Yo = 1| A = 0, Z. C). Equations (4.9) and (4.10) 
mainly differ in the way Pr(Yo = 1| A = 1 ,Z,C) is estimated. More specif¬ 
ically, (4.9) obtains Pr(Yo = 1|^4 = 1,Z,C) using Pr(Yi = 1|^4 = 1,Z,C) 
as a baseline risk for the model while (4.10) uses Pr(Po = 1\A = 0 ,Z,C) 
as baseline risk. This difference is important since Vansteelandt and Goet- 
ghebeur (2003) failed to obtain a DR estimator of ((Z, C ) while as we show 
next, our choice of parameterization yields a DR estimator of the marginal 
ETT. 

Heretofore, we have constructed estimators in two different approaches. 
Both approaches assume correct models for a(Yo, Z,C;p) and fz\ci z \ c i P)- 
The IPW approach further relies on a consistent estimator of the baseline ex¬ 
tended propensity score f3(Z, C) = logit Pr(H = 1 |Yq = 0, Z, C), which un¬ 
der the logit link and together with o:(Yq. Z,C;rj), provides a consistent esti¬ 
mator of the extended propensity score tt(Yq, Z, C\ 7 ) = expit {a(Yo, Z,C; rj)+ 

13(Z,C]6 )}. The OR approach further relies on a consistent estimator of 
f(Y\A = 0,Z,C), which together with a(Yo, Z, C; p), provides a consis¬ 
tent estimator of Pr(Yo = 1| A = 1 ,Z,C) by (4.6). Define A4 a as the 
collection of laws with parametric models fz\c( z \ c \P)-> a(Yo> Z, C] rj) and 
/3 (Z,C\9) while f(Y\A = 0 ,Z,C) is unrestricted. Likewise, define A4 y as 
the collection of laws with parametric models fz\c( z \ c i P)i a(To> Z, C; rj) 
and f(Y\A = 0, Z, C;£) while /3(Z,C) is unrestricted. The main appeal 
of a doubly robust estimator is that it remains consistent if either f3(Z, C; 0) 
or f(Y\A = 0, Z, C]£) is correctly specified. To derive a DR estimator for ip 
in the union space Ma^Aiy, we first propose a DR estimator for the param- 
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eter g of the selection bias model a(Yo, Z,C',rj). For notational convenience, 
let 


(4.11) 

Q g (Y,A 1 Z 1 C-'y,C) 

(l-^(y,Z,C; 7 ) [ _ E[e W {a(Y, Z, C; g)}g(Y, C)\A = 0, Z, C; g] 

1 — 7r(y, Z, C; 7) [ 9{ ’ J Fi[exp{a(y,Z,C;r ? )}|A = 0,Z,C;e] 

£[exp{g(y, Z, C; r/)}ff(P, C)| A = 0, Z, C; 

E[exp{a(Y,Z,C ] g)}\A = 0,Z,C ;£] 


Equation (4.11) is key to obtaining a DR estimation of the selection bias 
function and thus of ETT. Specifically, consider the estimating equation for 
the selection bias parameter 77 


(4.12) 


HZ, C ) - E{ui(Z, C)\C] p}] Q g (Y, A, Z, C; 7 , |) 


= 0 , 


where 


Q g (Y,A,Z,C; 7,0 

Q 9 (y,A^c , ;7,0 + (i-4i) ff (y,c) 

l-7r(y,Z,C; 7 ) ff(1,C,) 

A - 7r(y, Z, C; 7) E[exp{a(y, Z, C; r/)}g(Y, C)jA = 0,Z, C; £] 


+ 


1 - 7r(y, Z, c- 7) E[exp{a(y, Z, C; g)}\A = 0, Z, C; £] 


We solve equation (4.12) jointly with equations (4.2)-(4.4) with 7 replaced 
by 7 = ( f) DR , 6). The choice of hi, h 2 ,g and w can be decided as in Sections 
4.1 and 4.2. 


Proposition 4. Under (IV.1)-(IV.2) and condition 1, fj DR and ip DR are 
consistent in the union model M. a \JM. y , where ip DR = P n Qg(Y, A, Z,C;j,£)/ 
Pr(A = 1) and g(Y,C) = Y. 

Proposition 4 implies that fi DR and 'i\) DR are both DR estimators since 
their consistency only requires either the extended propensity score or the 
outcome regression model to be correctly specified but not necessarily both. 
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4 . 3 . Local efficiency. The large sample variance of doubly robust estima¬ 
tors i) DR and if DR at the intersection submodel A4 a n M y where all models 
are correctly specified, is determined by the choice of g(Y,C) and ui(Z,C) 
in equation ( 4 . 12 ). In the Supplementary Materials, we derive the semipara- 
metric efficient score of (77, if) in a model M np that only assumes that Z is a 
valid IV and the selection bias function a(Yo, Z,C m ,rj) is correctly specified. 
As discussed in the Supplementary Materials, the efficient score is generally 
not available in closed-form, except in special cases, such as when Z and Y 
are both polytomous. Next, we illustrate the result by constructing a locally 
efficient estimator of (77, if) when Z and Y are both binary. In this vein, 
similar to the definition of Q g (Y, A, Z, C; 7, £), define 


Q v (Y,A,Z,C; 7 ,0 = 


(l-A)v(Y,Z, C) 


+ 


1-7 t(Y,Z,C ;1 ) 

A - n(Y, Z, C; 7) £[exp{q(Y, Z, C; 77)}u(V, Z, C)\A = 0, Z, C; 
1 - 7 t(Y, Z, C- 7) E[exp{a(Y, Z, C; g)}\A = 0 , Z, C ; £] 


where v is any function of (Yq ,Z,C). 

A one-step locally efficient estimator of 77 in Ai np is given by 


V eff = V° R ~ 

wh evev(Y,Z,C) = {Y — E(Y\C)}{Z — E(Z\C)}, A (77) = Q„(Y,A,Z,C; 7 ,0 
and = £ , {A(77)A(77) T |C';7,|} _1 £'{clA(77)/977 T |C';7,^}A(?7 e ^^) is the 

efficient score of 77 evaluated at the estimated intersection submodel A4 a n 
M y . Further, let if DR (fj e H) denote a DR estimator for if evaluated at the 
estimated intersection submodel M a ^M y with fj e ff substituted in for t] DR . 
Then the efficient estimator of if is given by 


if eff = if DR (iT ff )-E{A 2 (f] eff )\C; 7, |}- 1 Fi{7/;- DiJ (7f // )A(77 e// )|C; 7, £}A{f) eff ). 

5. Simulations. Simulations for both binary and continuous outcomes 
were conducted to evaluate the finite sample performance of the causal effect 
estimators derived in Sections 4.1 and 4 . 2 . Let M c a denote the complement 
space of M a and likewise define M y . Simulations were conducted under 
three scenarios: (i) M a LiM y , that is both outcome regression and extended 
propensity score are correctly specified, (ii) M a H M y that is only the ex¬ 
tended propensity score is correctly specified and (iii) M^LiMy that is only 
the outcome regression model is correctly specified. 

Simulations were first carried out for a binary outcome. For scenario (i), 
the simulation study was conducted in the following steps: 
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Step 1: A hypothetical study population of size n was generated and each indi¬ 
vidual had baseline covariates C\ and C 2 generated independently from 
Bernoulli distributions with probability 0.4 and 0.6 respectively. Then 
the IV Z was generated from the model: logit Pr [Z = 1| C) = 0.2 + 
0.4(71 — 0.5(72 and potential outcomes Yq,Yi from models logit Pr(Yo 
= 1\Z,C) = 0.6 + O. 8 C 1 - 2<7 2 and logit Pr(Vi = l\Z,C) = 0.7 - 
0.3(7i. The treatment variable A was generated from logit Pr(A = 
1 \Yq,Z,C) = 0.4 + 2 Z + 0.8(7i — O. 6 Y 0 — L 6 C 1 Z, and the observed 
outcome was Y = lo(l — A) + Y\A. 

Step 2 : The following extended propensity score model was estimated and the 
parameters 7 = { 6 \, 62 - 0 3 . 64 , 7 ) in the model 

(5.1) logit Pr(A = l|Yb, C- 7 ) = + 0 2 Z + 0 3 ( 7i + O^Z + 1 ]Y 0 

were estimated using estimating equations (4.2)-(4.5) with h\(Z, C) = 
(Z, CiZ) T , h 2 {C) = ( 7 i, t(Y,C) = Y and l(Z,C) = Z and was 
evaluated. 

Step 3: The selection bias function was correctly specified as in (4.8), £ in the 
regression outcome model 

(5.2) logit E(Y\A = 0, Z, C; £) = £1 + 6 C 1 + £ 3 C 2 + £ 4 Z + £ 5 CiZ 

was estimated by restricted MLE, and a was estimated by solving 
equation (4.7) with u(Z, C) = Z and g(Y, C ) = Y and ^ re9 was eval¬ 
uated. 

Step 4: The selection bias function was correctly specified as in (4.8), £ in 
equation (5.2) was estimated by restricted MLE, parameters 7 in (5.1) 
was estimated using (4.2)-(4.4) and (4.12) where h,t,l,ui,g are chosen 
as in Step 2 and Step 3 and , i/j dr was evaluated. 

Step 5: Steps 1-4 were repeated 1000 times. 

The data generating mechanism described in Step 1 satisfies the assump¬ 
tions (IV.1)-(IV.2) for both a = 0,1. As shown in example 1, i/j is identified 
from the observed data since the treatment mechanism is a separable logit 
model. Also in the Supplementary Materials, we verify that model (5.2) for 
E(Y\A = 0, C, Z) contains the true data generating mechanism. Simulations 
for scenario (ii) were similar to scenario (i) except that (5.1) was replaced 
with 


(5.3) 


logit Pr(A = 1\Y 0 , Z, C; 7 ) = 0i + 0 2 Z + 0 3 Ci + r,Y 0 , 


14 


LIU ET AL. 


Fig 1: Performance of the IPW, OR and DR estimators of if} with binary 
outcomes. 


¥ 



(a) Both outcome regression and ex¬ 
tended propensity score are correctly 
specified 

¥ ¥ 




(b) Only the extended propensity (c) Only the outcome model is cor- 
score is correctly specified rectly specified 


Note: In each boxplot, the true value tpo is marked by the horizontal lines, white boxes 
are for n = 1000 and grey boxes are for n = 5000. 
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Table 1: Empirical coverage rates based on 95% Wald confidence intervals 
for both binary and continuous outcomes 



Binary Y 

Cont 

. Y 

sample size (n) 

1000 

5000 

1000 

5000 

(i) both 7T and /r 

are correct 



^ipw 

0.86 

0.90 

0.96 

0.95 

-4> reg 

0.84 

0.92 

0.97 

0.95 

■4> dr 

0.85 

0.91 

0.97 

0.96 

(ii) only n is correct 




^ipw 

0.86 

0.90 

0.96 

0.95 

v b reg 

0.79 

0.60 

0.39 

0.00 

$ dr 

0.86 

0.91 

0.97 

0.95 

(iii) only is correct 




^ipw 

0.78 

0.53 

0.39 

0.00 

^ reg 

0.84 

0.92 

0.97 

0.95 

i, DR 

0.85 

0.92 

0.96 

0.96 


The coverage was evaluated under three scenarios: (i) both outcome regression and the 
extended propensity score are correctly specified, in (ii) only the extended propensity 
score is correct and in (iii) only the outcome regression model is correct. 


which is misspecified if $4 / 0 in equation (5.1). For scenario (iii), the 
potential outcome model (5.2) was replaced with 

(5.4) logit E(Y\A = 0, Z, C\ 0=6 + &Ci + 6 Z, 

which is misspecified if 6 / 0 and 6/0 in equation (5.2). We use the R 
package BB (Varadhan and Gilbert, 2009) to solve the nonlinear estimating 
equations. Simulation results for 1000 Monte Carlo samples are reported 
in Figure 1 and empirical coverage rates are presented in Table 1. Under 
correct model specification, all estimators have negligible bias which dimin¬ 
ishes with increasing sample size. In agreement with our theoretical results, 
the IPW and regression estimators are biased with poor empirical coverages 
when the extended propensity score or the outcome model is mis-specified, 
respectively. The DR estimator performs well in terms of bias and coverage 
when either model is mis-specified but the other is correct. When all models 
are correctly specified, the relative efficiency of the locally semiparametric 
efficient estimator compared to the DR estimator of r\ and / are 0.840 and 
0.810 respectively, based on Monte Carlo standard errors at sample size 
n = 5000. This shows that substantial efficiency gain may be possible at the 
intersection submodel when using the locally efficient score. 
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Simulations for a continuous outcome were conducted similarly as for the 
binary outcome in the following steps. 

Step 1*: Covariates C\ and C 2 were generated as in Step 1, Z was generated 
from model logit Pr(Z = 1|C) = 0.7 + O. 8 C 1 — C 2 , and Y 0 . Y\ from 
models Y 0 \Z, C ~ JV(0.5+Ci+ 3C' 2 , 1) and Yi\Z, C ~ JV(l.l-1.3Ci, 1), 
A was generated from logit Pr(M = l|Yo, Z, C) = — 0.2 — 3 Z — 3Ci + 
0.3Yo + 4C7iZ, and Y = Y 0 ( 1 - A) + YiA. 

Step 2 *: Same as Step 2 . 

Step 3*: Same as Step 3 except the following regression outcome models were 
fit to the data. 

E{YeMvY)\A = 0,Z,C-,£} = £1 + &C 1 + &C 2 + £ 4 Z + £ 5 CiZ 

(5.5) +^ 6 ^ 2 + + £jC\ C 2 + C 2 Z. 

E{exp(rjY)\A = 0, Z, C; C} = 6 + 60 C', + Z U C 2 + 2 Z + £ 13 ^ 

(5.6) +£ 14^2 Z + £ 15 ( 71(72 + £ 16 ^ 1(72 Z. 

Step 4*: Same as Step 4 except that (5.2) was replaced by (5.5) and (5.6). 
Step 5*: Same as Step 5. 

Simulation for a continuous outcome under scenario (ii) was carried out 
similarly as that for scenario (i) except that (5.1) was replaced by (5.3). For 
scenario (iii), the potential outcome models (5.5) and (5.6) were replaced 
with the linear models 

(5.7) E{Y exp(r?+)|7L = 0, Z, C; £} = £1 + £ 2 C, + £ 4 Z. 

(5.8) £{exp(T 7 y)| 7 i = 0, Z, C; £} = £9 + £i 0 Ci + £i 2 Z. 

We use the R package nleqslv (Hasselman, 2014) to solve the nonlinear 
estimating equations. 

We verify in the Example 4 of the Supplementary Materials that ^ is 
identified from the observed data. The simulation results for 1000 Monte 
Carlo samples are reported in Figure 2 and empirical coverage rates are 
presented in Table 1. Results are similar to the results for the binary out¬ 
come. Under correct model specification, all estimators have negligible bias 
which diminishes with increasing sample size. The IPW and OR estimators 
are biased with poor empirical coverages when the corresponding model is 
mis-specified. The DR estimator performs well in terms of bias and coverage 
when either the extended propensity score or the outcome regression model 
is correctly specified. 
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Fig 2: Performance of the IPW, OR and DR estimators of i/j with continuous 
outcomes 


v 



(a) Both models are correctly specified 


V 


"_d 

1 rh -_r 

T 

t 

] ■=? i 

* E 

□ 

] f 3 1 



IPW REG DR 


V 



(b) Only the extended propensity (c) Only the outcome model is cor- 
score is correctly specified rectly specified 

Note: In each boxplot, the true value ipo is marked by the horizontal lines, white boxes 
are for n = 1000 and grey boxes are for n = 5000. 
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6. Application. Since the 1980s, tax-deferred programs such as indi¬ 
vidual Retirement Accounts (IRAs) and the 401 (k) plan have played an 
important role as a channel for personal savings in the United States. Aim¬ 
ing to encourage investment for future retirement, the 401 (k) plan offers tax 
deductions on deposits into retirement accounts and tax-free accrual of inter¬ 
est. The 401 (k) plan shares similarities with IRAs in that both are deferred 
compensation plans for wage earners but the 401 (k) plan is only provided 
by employers. The study includes 9275 people and once offered the 401 (k) 
plan, individuals decide whether to participate in the program. However, 
participants usually have a stronger preference for savings which suggests 
the presence of selection bias. This was addressed as individual heterogene¬ 
ity by Abadie (2003) and it has been pointed out that a simple comparison 
of personal savings between participants and non-participants may yield re¬ 
sults that were biased upward. It was also postulated that given income, the 
401 (k) eligibility is unrelated to the individual preferences for savings thus 
can be used as an instrument for participation in 401 (k) program (Poterba 
and Venti, 1994; Poterba, Venti and Wise, 1995). The compiler causal effect 
for the 401 (k) plan was studied by Abadie (2003). Here, we reanalyze these 
data to illustrate the proposed estimators of the marginal ETT. 

We illustrate the methods in the context of a dichotomous outcome de¬ 
fined as the indicator that a person falls in the first quartile of net savings 
of the observed sample (equal to —$500). The treatment variable is a binary 
indicator of participation in a 401 (k) plan and the IV is a binary indica¬ 
tor of 401 (k) eligibility. The covariates are standardized log family income 
(log 10 (income) — 4.5), standardized age (age — 41) and its square, marital 
status and family size. Age ranged from 25 to 64 years, marital status is 
binary indicator variable and family size ranges from 1 to 13 people. These 
covariates are thought to be associated with unobserved preferences for sav¬ 
ings. Let 'i/’ = E(Yq\A = 1) denote for a family that actually participated 
in the 401 (k) program, the probability that they would have had net finan¬ 
cial assets above the first quartile, had possibly contrary to fact, they been 
forced not to participate in the program. The ETT = E(Y\ — Yq\A = 1) is 
the effect of 401 (k) plan on the difference scale for the probability of family 
net financial assets above the first quartile among participants. Equivalently, 
ETT can also be interpreted as an effect of the intervention in reducing a 
person’s risk for poor savings performance as measured by falling below the 
first quartile of the empirical distribution of savings for the sample. Before 
implementing our IV estimators, we first obtained a standard IPW estima¬ 
tor of the ETT under an assumption of no unmeasured confounding, i.e. 
ip l Q W defined as r tp tpw with a = 0. Thus, the extended propensity score was 
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modeled as: 

logit Pr(A = 1\Z ,C) = 1 + Z + log(income) + married + age + fsize + age 2 , 

and estimated by standard maximum likelihood. The IPW estimate of ip 
was ipQ >w = 0.688 with standard error (se) 0.014, where se was evaluated 
using the sandwich estimator accounting for all sources of variabilities. In 
comparison, the estimator based on the empirical estimate of E(Y\A = 
1) was 0.883 (se = 0.006). Thus an estimate of ETT was ETT = 0.194 
(se = 0.016), which suggests the 401 (k) plan may have a significant effect 
on increasing the family net financial assets among participants. 

However, this result may be spurious due to the suspicion that even after 
controlling for observed covariates, there may still exist unmeasured factors 
that confound the relationship between 401 (k) plan and the family net finan¬ 
cial assets. Assuming assumptions (IV.1)-(IV.2) and condition 1, we applied 
the methods proposed in Section 4 to estimate the ETT in the presence of 
unmeasured confounders. The following parametric models were considered: 

logit Pr(Z = 1|C) = 1 + log(income) + married + age + fsize + age 2 , 

logit Pr(y = 1 |A = 0 ,Z,C) = l+Z+log(income)+married+age+fsize-|-age 2 , 

We specified the selection bias function as in (4.8), thus the selection bias 
function was assumed to depend on Yq linearly. Possible deviations from this 
simple model was explored by allowing for potential interactions of Yq with 
observed covariates in the extended propensity score. Thus, we posited the 
following parametric model for the extended propensity score which satisfies 
identifying condition 1 as a submodel of the separable model: 

logit Pr(A = 1\Yq,Z,C) = l-|-Z-|-Yo+log(mcome)+married-|-age-|-fsize+age 2 , 

Table 2 reports point estimates and estimated standard errors for the IV, 
extended propensity score and the outcome regression models. Although 
the DR estimator also involves an outcome regression model among the 
unexposed, it is the same model as required for the regression estimator, thus 
these estimates are only repeated once. The instrument is strongly associated 
with family income (logOR = 2.823, se = 0.106), age (logOR = 0.007, se = 
0.002) and age square (log OR = —0.002, se = 2e -4 ). The selection bias 
parameter was estimated to be 0.320 (se = 0.115) by IPW, 0.385 (se = 0.135) 
by OR and 0.280 (se = 0.101) by DR estimation. This provides strong 
evidence that unmeasured confounding may be present and the stronger 
saving preference one has, the more likely one would participate in the 401 (k) 
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plan. All three estimators of the marginal ETT also agree with each other: 
they are significant but with a smaller Z-score value than when the selection 
bias is ignored (for example, the IPW estimator suggests ETT = 0.134, 
se = 0.013). The efficient estimator for the selection bias parameter is 0.273 
and for the ETT is 0.137, both in agreement with the other three estimators. 
Thus we may conclude that even after adjustment for unobserved preferences 
for savings, the 401 (k) plan still can increase net financial assets among 
participants. 

These findings roughly agree with results obtained by Abadie in the sense 
that the IV estimate corrects the observational estimate towards the null. 
However, it may be difficult to directly compare our findings to those of 
Abadie who reported the compilers average treatment effect under a mono¬ 
tonicity assumption of the IV-exposure relationship, and assuming no un¬ 
measured confounding of this first stage relation. Our approaches rely on 
neither assumption, but instead rely on condition 1 encoded in the func¬ 
tional form of the extended propensity score model for identification. In or¬ 
der to assess the robustness of the selection bias model, additional functional 
forms were explored. We considered adding to a an interaction between Y$ 
and each of the covariates: log income, marriage status, family size. There 
was no evidence in favor of any such interaction. 

Table 2: Point estimates and estimated se [in bracket] of IPW, OR and DR 
estimators for ETT of 401 (k) plan as well as the parameters for IV, extended 
propensity score and outcome regression outcome models required by those 
estimators 



IV 

model 

IPW propensity 

regression 

DR propensity 

Intercept 

-0.180 

[0.058] 

-8.685 

[1.832] 

1.307 

[0.073] 

-8.629 

[1.796] 

line 

2.695 

[0.107] 

1.626 

[0.210] 

0.618 

[0.128] 

1.633 

[0.209] 

age 

0.007 

[0.002] 

-0.009 

[0.005] 

0.035 

[0.003] 

-0.009 

[0.005] 

fsize 

-0.037 

[0.019] 

-0.004 

[0.033] 

-0.127 

[0.022] 

-0.005 

[0.033] 

marr 

-0.145 

[0.063] 

-0.032 

[0.108] 

-0.133 

[0.075] 

-0.031 

[0.108] 

age 2 

-0.002 

[2e-04j 

0.001 

[4e-04] 

6e-04 

[3e-04j 

0.001 

[4e-04] 

Z 



9.150 

[1.820] 

-0.210 

[0.074] 

9.126 

[1.781] 

a 



0.320 

[0.115] 

0.385 

[0.135] 

0.280 

[0.101] 

ip = E(Y 0 \A 

= 1) 


0.749 

[0.012] 

0.746 

[0.012] 

0.750 

[0.012] 

ETT 



0.134 

[0.013] 

0.137 

[0.014] 

0.132 

[0.014] 


7. Discussion. In this paper, we establish that access to an IV allows 
for identification of an association between exposure to the treatment and 
the potential outcome when unexposed, which directly encodes the magni¬ 
tude of selection bias into treatment due to confounding. We propose IPW, 
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OR as well as DR estimators for the treatment effect amongst treated indi¬ 
viduals. Vansteelandt and Goetghebeur (2003) and Robins (1994) proposed 
identification and inference approaches under no-current treatment value 
interaction assumption, thus their estimators remain consistent under the 
null hypothesis of no ETT. In contrast, the identification and inference ap¬ 
proaches we proposed may be particularly valuable when an ITT analysis 
indicates a non-null treatment effect and thus Robins’ identification assump¬ 
tion of no-current treatment value interaction may be violated. 

The proposed methods assume the treatment is binary. They can be gen¬ 
eralized without much effort to categorical treatment. However, when the 
treatment is continuous (for example, A is treatment dose), then a paramet¬ 
ric model for the treatment effect as well as a model for the density of A may 
be unavoidable for estimation. We leave this as a topic for future research. 
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Appendix A contains proofs of the propositions. Appendix B presents 
proofs of the examples in the main text, and more examples about identi¬ 
fication of the models. Appendix C presents more derivations mentioned in 
the main text. Appendix D presents derivations of semiparametric efficiency 
theory. 


APPENDIX A: PROOFS OF PROPOSITIONS 


Proof of Proposition 1 

We prove by contradiction. Suppose we have two candidates Pri(A, Yq. Z, C ) 
and Pr 2 (A, Yq, Z , C) satisfying the same observed density: 


Pn(A,Y 0 ,Z,C) = Pt 2 (A,Y 0 ,Z,C). 


By the assumption (IV.2), we have the decomposition for the joint distribu¬ 
tion: 


fj(A, Yq, Z, C) = f j (C)f j (Z\C)f j (Yo\C)f j (A\Yo 1 Z, C) for j = 1,2. 


Since f(C) and f(Z\C) can be identified from the observed data, we have 


fi(C) = f 2 (C) and h{Z\C) = f 2 {Z\C). Thus, 

h(Y 0 \C)Pn(A = 0|V 0 , C) = / 2 (Vo|C)Pr 2 (A = 0|V 0 , C), 


and equivalently 


Pt 1 (A = Q\Y 0 ,Z,C) _ f 2 (Y 0 \C) 
Pr 2 (A = 0|1 q) Z, C) h(Y 0 \C)- 


The equation contradicts the condition that we require the ratios unequal. 
Thus, that the ratios are not equal is equivalent to the impossibility of two 
sets of candidates satisfying the same observed quantities, i.e. the identifia- 
bility of the joint distribution. 

Proof of Proposition 2 

We first prove equation (4.1). Note that 



E(Y 0 \A = 1) 
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Thus, equation (4.1) is proved. 

We show that if tt(Yo, Z, C) is correctly specified, the equations (4.2)-(4.5) 
hold at the true value 7 thus they are indeed unbiased estimating equations 
for 7 . The equality is easy to show for (4.2)-(4.4) by the law of iterated 
expectations. For (4.5), the assumptions (IV.1)-(IV.2) imply Yq X Z\C, 
thus 

E [ \-l<X*z,c) t{y ' C){l{Z ’ C) ~ EmC)|C)}] 

= _^Y 0 A Z , C ) t{Y °' C){KZ ’ C) ~ E{l{Z ’ C)|C)}] 

= E[t(Y 0 ,C){l(Z,C)-E(l(Z,C)\C)}} 

= E[E(t(Y 0 , C)\C){E(l(Z, C)\C) - E(l(Z, C)\C)}] 

= 0 . 

Thus, by equation (4.1), ip ipw is consistent for ip. 

Assume 


(A.l) 

J d 1 — A 

\_dl T 1 - vr(To, Z, C; 7 ) 


l 1 \ 

h 1 (Z,C)-E(h 1 (Z,C)\C) 
h 2 (C) - E(h 2 {C)) 

\t(Y, C){l(Z, C) — E(l(Z, C)\C)}J 


is invertible. 


Condition (A.l) is sufficient for local uniqueness of nuisance parameter esti¬ 
mates obtained from equations (4.2)-(4.5) and thus ip is identified from the 
observed data. 

Proof of Proposition 3 

We first prove equation (4.6). Note that 
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= E 


E 


= E 


E[exp{a(Y, Z, C)}g(Y, C)\A = 0, Z, C\ 
E[exp{a{Y, Z, C)}\A = 0, Z, C\ 
E[exp{a(Y 0 , Z, C)}g(Y 0 , C)\A = 0, Z, C\ 
E[exp{a(Y 0 ,Z,C)}\A = 0,Z,C\ 
f f{Yp\A = 1, Z, C)f(Y 0 = 0\A = 0, Z, C) 
if(Y 0 \A = 0, Z, C)f(Y 0 = 0\A = 1, Z, C) 

f f(Yp\A = 1, Z, C)f(Y 0 = 0\A = 0, Z, C) 
if(Y 0 \A = 0, Z, C)f(Y 0 = 0\A = 1, Z, C) 

/(lb|4= 1,Z,C) g (Y iC )\A = 0,Z, C ]/ E 


g(Y,C)\A = 0 : Z,C 
\A = 0,Z,C 


lf(Y o \A = 0,Z,C) 
E(g(Y 0 ,C)\A = 1,Z, C)/l 
E(g(Y 0 ,C)\A = l,Z,C). 


f(Y 0 \A = l,Z,C) 
Vf(Y 0 \A = 0,Z,C) 


/ 

■\A = 0,Z,C 


We then show that equation (4.7) holds at the true value of £ and rj and 
thus are indeed unbiased estimating equation for g. Note that by (IV. 1)- 
(IV.2), we have Vo -A- Z\C, thus 

E[{w(Z, C ) - E(w(Z, C)\C)}{AE(g(Y 0 , C)\A = 1, Z, C) + (1 - A)g(Y, C)}] 
= E[{w(Z, C) - E(w(Z, C)\C)}{Ag(Y 0 , C) + (1 - A)g(Y 0 , C)}] 

= E[{w(Z , C) - E(w(Z, C)\C)}g(Y 0 , C)] 

= E[{E(w(Z, C)\C) - E(w(Z, C)\C)}E(g(Y 0 , C)\C)} 

= 0 . 


Consistency of the regression estimator ip re9 follows from equation (4.7). 
Assume 


(A.2) 

E{{u(Z, C)-E(uj{Z, C)\C)}A 


d E(exp{a(Y , Z, C; rj)}g{Y, C)\A = 0, Z, C) 
dg £'(exp{a(V, Z, C\ g)}\A = 0, Z, C) 


} 


is invertible. 


Condition (A.2) is sufficient for local uniqueness of an estimator for g ob¬ 
tained from equation (4.7). To see the relationship between (A.2) and the 
first order derivative of (4.7), note that 


d ( 

dg 


E[{u(Z, C ) - E(u(Z, C)\C)}{AE(g(Y 0 , C)\A = 1, Z, C; g) + (1 - A)g(Y, C)}] 


= E[{u(Z,C) -E{uj{Z,C)\C)}{A 
= E[{u(Z,C) - E{u(Z,C)\C)}{A 


d E(g(Y 0 , C)\A = 1, Z, C; g) + 

og 

d E(exp{a(Y,Z,C-g)}g(Y,C)\A = 0,Z,C) 
dg E(exp{a(Y,Z,C;g)}\A = 0,Z,C) IJ ' 
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Proof of Proposition 4 

We use the superscript * to denote a misspecified model. Otherwise, an 
expectation or a model is always evaluated at the true value of parameters. 
Note that by (IV.1)-(IV.2), we have Yo X Z\C . If parametric models he in 

p 

M a , then 7 —> 7 and 


E 


Hz, C ) - E(u(Z, 0\C)}Q g (Y, X, Z, C; 7 ,1) 


4 E[{u(Z,C)-E(u(Z,C)\O}g(Y 0 ,C)] 

= E[{E(u;(Z,C)\C) - E(u(Z,C)\C)}g(Y 0 ,C)] = 0. 


Additionally, 


E 

E{ 

E{ 


DR 


z, C){Yq - E*(Y 0 \A = 1, Z, C)} E*(Y 0 \A = 1, Z, C)n(Y 0 , Z, C) 


Pr(A = 1) 


Pr(A = 1) 


7 r(y 0 ,Z, C)Y 0 , 
Pr(A = l) ' 
AY 0 , 


Pr(A = 1)' 
E(Yq\A = 1) = i>. 


Thus, f/ DR and ^ DR are consistent if the parametric models lie in M a - 
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If parametric models lie in M y : 


E 


E 


{u(Z, C ) - E(oj(Z, C)\C)}Q g (Y , A, Z, C ; 7 ,1) 
{u(Z,C)-E(uj(Z,C)\C)} 


E(g(Y 0 , C) exp(o(Y 0 , 4 <4)4 = 0, Z, C) 


{(1 - A) exp{a(lo) Z, O + 0* (*, C)}MV„, O) z> c))|y) _ z> c) 

+^£(s(4, <44 = 1 , z, c) + (1 - a) 5 (y 0 , <4} 


= E 


{44 C) - E(uj(Z, C)|C)} AE(g(Y 0 , C)\A = 1, Z, C) + (1 - A)g(y 0 , C) 


= £[{44 C) - £(44 C)|C)}{Pr(A = 1|Z, C)E(g(Y 0 , C)\A = 1, Z, C) 
+ Pr(A = 0|Z, C)E(g(Y 0 , C)\A = 0, Z, C)}] 

= £[{44 c ) - £( 4 z, c)|c)}£( 5 (y 0 , <414 c)] 

= £[{£(44 C)\C) - E(u(Z, C)\C)}E(g(Y 0 , C)\C )] 

= 0. 


Also, 


4 


4 , DR 


f l-A ir(Y 0 ,Z, C) f E(Y 0 exp (a(Y 0 , Z, C))\A = 0, Z, C) } 

|_Pr(A = 1)1 — 7 t(Yo> Z, C) \ E(exp(a(Y o ,Z,C))\A = 0,Z,C) J 

AE(Y 0 \A = l,Z,cy 

Pr {A = 1) 


E 

+ 


l-A 
Pr(A = 1) 

£(44) ' 

Pr(A = 1) 
AY 0 


exp{a(Y 0 ,Z,C)+(3*(Z,C)} 


E ^Pt{A = 1) j 
E(Y 0 \A = 1) = 4 


E(Y 0 exp(a(y 0 , Z, C))\A = 0, Z, C) \ 
£(exp(a(io)Z, C))\A = 0,Z, C) J 


Thus, f/ DR and ^ DR are consistent if the parametric models lie in M y . 
Therefore, i) DR and i/j dr are DR for g and -0 respectively. 


APPENDIX B: PROOFS FOR EXAMPLES IN SECTION 3 


Proof of example 1 
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Let Pr(^4 = 0|To, Z\ 0i, 0 2 , pi, rj 2 ) = expit (0i + d 2 Z + pi To + t^YqZ) and 
Pr(Yo = 1; r) = exp(r). We show that for any (0i, 0 2 , pi, p 2 , r), there exists 
( 0 i- 02 -Pi,P 2 -t) / (0i, 6 * 2 , 771 , 7 / 2 ,r) such that 
(B.l) 

Pr(A = 0|Yb, 01,02, m,V 2 ) Pr(y 0 ; r) = Pr [A = 0\Y 0 , Z; 0i, 0 2 , pi, p 2 ) Pr(Y 0 ; f). 

Suppose there exists pi 7 ^ 0 such that Pr(lo = 0;f)/Pr(Yo = 0;r) = 
exp(pi), thus, (B.l) is equivalent to 


(B.2) 


Pr(A = OlFo, Z; 0i, 0 2 , 7 / 1 , 772 ) 
Pr(A = 0|>o,Z; 0i, 0 2 , 771 , 7 ) 2 ) 


Pr(lo;f) 
Pr (Y 0 -t) 


exp(pi +P 2 Y 0 ), 


where p 2 = log[exp(—pi — r) + {exp(r) — 1}/exp(r)]. 

Note that two different sets of parameters would lead to the same observed 
data distribution by properly choosing pi and choosing 0i = 0i — pi — log w \, 
02 = 02 + log VJl — log W2,fji = Pl — P2 + log mi - log G 7 3 , f) 2 = P2 + log W 2 + 
log m$ — log mi — log tz74 and f = r + pi + p 2 , where m 1 = 1 + exp(0i) — 
exp(0i—pi), m 2 = l+exp(0i + 0 2 )-exp(0i+0 2 -pi), tu 3 = l+exp(0i+pi) — 
exp(0i +r/i — pi -p 2 ) and = 1 + exp(0i + 0 2 + pi + p 2 ) - exp(0i + 0 2 + 771 + 
r)2 ~ Pi — P 2 )• For example, choose p\ = 0.3, p 2 = -0.38, (0i, 0 2 , pi, p 2 , ri) = 
(0.3, 0.6, 0.1,0.7,-0.2) and (0 1; 0 2 , fji, fj 2 , f) = (-0.3,0.41,0.91,1.37,-0.28), 
they lead to the same observed distribution. 

Proof of example 2 

The separable treatment mechanism implies p- 2 = fj 2 = 0, and thus 
tu 2 ro 3 = •cc7 1 -cc 74, i.e. {1 + exp(0i + 0 2 ) - exp(0i + 0 2 - pi)}{l + exp(6b + 
pi) - exp(0i + pi - pi - p 2 )} = {1 + exp(0i) - exp(0i - pi)}{l + exp(0i + 
02 + Pi) — exp(0i + 0 2 + pi — pi — p 2 )} which indicates 


(P-3) 


exp(p 2 ) 


_exp(pi)_ 

1 + exp(pi + pi) - exp(pi)' 


Since in (B.2), exp(pi + p 2 Yo) is the ratio of two densities for Yq. we have pi 
and pi + p 2 should be of the opposite sign. From equation (B.3), if pi > 0, 
then exp(pi) > 1 and exp(pi+p 2 ) > 1. Similarly, if pi < 0, then exp(pi) < 1 
and exp(pi + p 2 ) < 1. Thus, we conclude that pi = p 2 = 0, i.e. the separable 
treatment mechanism is identified for binary case. 

Proof of example 3 

Suppose there exist two densities that make the ratios equal, 


expit |gi(Z) + hi(Y 0 )} = hi} 0 ) 
expit {g 2 (Z) + h 2 (Y 0 )} fi(Y 0 )' 
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We first take derivatives over Z on both sides, and we have 

<9expit{<7i(Z) + hi(Y 0 )}/dZ _ <9expit{g 2 (Z) + h 2 (Y 0 )}/dZ 
expit {(ft (Z) + hi(Y 0 )} expit{q 2 (Z) + h 2 {Y 0 )} 

Expand the expit functions and simplify the equation yields 


(B.5) 


dgi (Z)/dZ 
dq 2 {Z)/dZ 


[1 + exp{g 2 (Z) + h 2 (Y 0 )}] 


1 + exp{gi(Z) + hi(Y 0 )}. 


Next, we take derivatives over To on both sides of the above equation, and 
we have 


d qi (z)/dzdh 2 (Y 0 ) 
dq 2 {Z)/dZ 8Y 0 


exp{g 2 (Z) + h 2 (Y 0 )} 


ex P {Qi(Z) + hi(Yo)}, 


which is equivalent to, 


dqi(Z)/dZ 

dq 2 (Z)/dZ 


exp {q- 2 {Z) 


Qi(Z)} 


dhi(Y 0 )/dY 0 

dh 2 (Y 0 )/dY Q 


exp{/i] (Tq) 


h 2 (Y 0 )}. 


The left hand side of the above equation is a function of Z, but the right 
hand side is a function of Tq. Thus, we must have 


dqi(Z)/dZ 

dq 2 {Z)/dZ 


exp {q 2 {Z) 


qi{Z)} = ci, 


for some constant ci. We multiply both sides of equation (B.5) by exp{— qi(Z)}, 
and we have 


ci[exp{-g 2 (Z)} + exp{/r 2 (y 0 )}] = exp{-<?i(Z)} + exp{/ii(y 0 )}, 
and thus for some constant c 2 , 

ci exp{-q 2 (Z)} +c 2 = exp{-gi(Z)}, a exp{h 2 (Y 0 )} - c 2 = exp{hi(y 0 )}. 

We substitute q 2 {Z) and h 2 (Yo ) in equation (B.4) with the expressions above 
to obtain 


exp{/ii(y 0 )} + c 2 


exp{/ti(y 0 )} 


MYo) 

f2(Y 0 y 


A(Yo) 

/ 2 (Eo) 


and thus 


1 + c 2 exp{—hi(To)}. 
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Note that 1 + C2 exp{— hi(lo)} > 1 for C2 > 0, and 1 + C2 exp{— hi(lo)} < 1 
for C 2 < 0. This cannot be true for the ratio of two densities. So we must 
have C 2 = 0, and thus ,/’i (To )/ f 2 (Ytj) = 1. As a result, the joint distribution 
is identified. 

The joint distribution is also identified in the separable treatment mech¬ 
anisms for continuous outcome with binary instrument. 

Example 4. Consider the case of continuous outcome with binary in¬ 
strument. Assume the Logistic separable treatment mechanism: Va\y 0 ,z = 
{Pr(A = 0|y 0 , Z) '■ logit Pr(A = 0|Yo, Z) = 8Z + h(Yo)}, where h is a known 
or unknown function. It can be shown that Va\y 0 ,z satisfies the condition 1 
and thus the joint distribution is identified. 


Proof of example 4 

Suppose there exist two sets of densities make the ratios equal, 

expit{0iZ + hi(Y 0 )} _ / 2 O 0 ) 

1 J expit{ 8 2 Z + h 2 {Y 0 )} fi(Y 0 y 

The above equation holds for both Z = 0,1, so we have 
expit{/ri(y 0 )} _ expit{6*i + hi (ip)} 

expit{/t 2 (Lo)} expit{0 2 + h 2 (Y 0 )} ' 


Simplifying the equation, we have 


exp{hi(To)} 


exp(0 2 ) ~ exp(0i) + (exp(0 2 ) ~ exp(0i + 0 2 )} exp{fe 2 (^o)} 
exp(0i) - exp(0i + 82 ) 


Substituting exp{/ii(lo)} with the above expression in equation (B.6), 
have 


/ 2 (y 0 ) 

h(Y 0 ) 


1 + 


exp(0 2 ) — exp(0i) 
exp(0 2 ) - exp(0i + 8 2 ) 


exp{-h 2 (lo)}- 


we 


If 81 y 0 2 , we must have h(Y 0 )/fi(Y 0 ) < 1 for any Y 0 , or f 2 (Y 0 )/f i(T 0 ) > 1 
for any Yo. This cannot be true for the ratio of two densities. So we must 
have 0i = 02, and thus /i(Yo)// 2 (Yo) = 1. As a result, the joint distribution 
is identified. 


Example 5. Assume the Probit separable treatment mechanism: Va\y 0 ,z 
= {Pr(A = 0| Y 0 ,Z) : Pr(A = 0|y 0 ,^) = ${g(Z) + h(Y 0 )}j, where T is 
the standard normal distribution function, q and h are known or unknown 
functions, and q is differentiable. Then the joint distribution of A, Yo, Z is 
identified. 
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Proof of example 5 

Suppose two sets of parameters make the ratio being a function of Yq, i.e. 
for some function s, 


Hqi(Z) + hi (Yo)} = ${<? 2 (Z) + h 2 (Y 0 )}s(Y 0 ). 

By taking derivatives over Z on both sides, we have 

^p-4>Ui(z) + fei(Vo)} = + fi2(vb)} s (r„), 

where <fi is the standard normal density function. And equivalently 

io g =io g ssza + log s(Yo) , 


4>{qi(z) + MM) 

which implies that 


dqi (Z)/dZ 


(B.7) 

WZ) + h 2 (Y 0 )} 2 - ( 9l (Z) + h 1 (Yb)} 2 = 2 {log|" + log S (lo)} • 

Note that the right hand side does not include an interaction term of Z and 
Yq, we have 

qi(Z)hi{Y 0 ) = q 2 (Z)h 2 (Y 0 ), 

and thus 

qi(Z) = h 2 (Y Q ) 

<72 (Z) hi(P 0 )' 

Hence qi{Z) = cq 2 (Z) and h 2 (Yo) = ch\ (Yq) for some positive constant c. 
Substituting q 2 and h 2 with l/cq± and chi in equation (B.7), we have 

(Jz ~ ^ 9i( z ) + ( c2 ~ ^M^o) 2 = 2{— logc + logs(Yo)}. 

Since the right hand side does not vary with Z, we must have c = 1, and 
thus qi(Z) = q 2 (Z) and h 2 (Y 0 ) = hi(Y 0 ). 


APPENDIX C: ADDITIONAL RESULTS MENTIONED IN 

THIS PAPER 

Regression estimator using any link function A 

Let 5{Z, C) = A{E(Yo|A = 0, Z, C )} and a(A, Z, C ) = \{E(Y 0 \A, Z, C)}- 
A{E(y 0 |-4 = 0, Z, C)}, then E{Y 0 \A = 1, Z, C) = A -1 ^!, Z, C) + 5{Z, C)}. 
Let 5(Z,C\£) denote a parametric model for 5(Z,C) and let £ denote the 
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restricted MLE of £ using only data among the unexposed. Although in the 
main text r/ is used to denote the parameter in a, here we use it to denote 
the parameters in a. We obtain an estimator for rj by solving: 


(C.l) 


{w(Z, C)-E(w(Z, C)\C; p)}\ AX^c^l, Z, C: V )+S(Z, C ; £)}+(l-A)y 


= 0 , 


We have the following proposition for the outcome regression estimator 
with any link function A, the proof is similar to that of Proposition 3. 


Proposition 5. Under (IV.1)~(IV.2) and condition 1, suppose that 
a(A, Z, C] rj), fz\c( z \ c i P) an d <5(Z, C;£) are correctly specified, then the 
outcome regression estimator 

i> re9 = Z, C- fj) + 5(Z, C- £)}, 

Pr(A = 1) 

is consistent for ijj. 


Equation (5.2) contains the correct model for E(Y\A = 0,Z,C) 
Since 


and 


logit Pr(lo = 1 \A,Z,C) 

Pr(lp = 1| A, Z, C) 

° 8 Pr(F 0 = 0 \A,Z, C) 

f Pr (Yp = 1| A, Z,C) Pr (Yp = 1\A = 0,Z,C) 

° 8 1 Pr(y 0 = 0| A, Z, C)' Pr(y 0 = 0| A = 0, Z, C) 

- log / pr(y ° = 1|Z ’ C) / pr(y ° + log 

8 1 Pr(y 0 = 0|Z, C)' Pr(y 0 = 0\A = 0 ,Z,C)J 8 


} 


Pr(Y 0 = l\Z,C) 
Pr(y 0 = 0|Z, cy 


E 

a 

E 

a 

E 


Pr(y 0 = 1| A = a,Z,C) 

Pr(y 0 = 0| A = a, Z, C) 

Pr(y 0 = l,A = a\Z,C) 

Pr(y 0 = 0,A = a\Z, C) 

Pr(y 0 = 1 ,A = a\Z,C) 


Pr(y 0 = 0|Z, C) Pr(A = o|y 0 = 0, Z, C) 


Pr (A = a\Y 0 = 0, Z, C) 

Pr(A = o|y 0 = 0,Z, C) 

Pr(A = a|y 0 = 0,Z, C) 


E 


Pr(y 0 = 1 ,A = a\Z,C) 
Pr(y 0 = 0|Z,C) 


Pr(y 0 = l|Z,C) 
Pr(Po — 0|Z, CY 
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we have 


logit Pr(Y 0 = 1| A, Z,C) 

f Pr(Y 0 = 1| A, Z,C) Pr(Y 0 = 1\A = 0, Z, C) } 
° 8 1 Pr(Y 0 = 0| A, Z, C) 7 Pr(Y 0 = 0|A = 0, Z, C) I 
Pr(Y 0 = 1| Z,C) Pr(Y 0 = 1\A = 0,Z,C) 
Pr(Y 0 = 0| Z, C) 7 Pr (Y 0 = oj A = 0 , Z, C) 
r Pr (Yp = 1| A, Z,C) Pr(Y 0 = 1\A = 0, Z, C) I 
° g l p r (Yb = 0| A, Z, C ) 7 Pr(Yo = 0|A = 0, Z, C) J 


logj 


} 


+ log 


Pr(Pp = 11Z, C) 
Pr(Po = 0\Z, C) 


lo e{E 


+ log 


Pr(y 0 = 1| A = a, Z,C) Pr(Y 0 = 1\A = 0,Z,C) 
Pr(Z 0 = 0| A = a, Z, C) 7 Pr(F 0 = 0|A = 0, Z, C) 

Pr(hp = 11Z, C) 

Pr(Po = 0|Z, C) 


Pr(^4 = a|Y 0 = 


= a(l, Z, C)A — log 


exp{a(l, Z, C)} Pr(A = 1| Y 0 = 0, Z, C) 


+ Pr(^ = 0|Y 0 = 0,Z, C) 


+ logit Pr(T 0 = 1IC). 


In our simulation a( 1, Z,C) = rj and Pr(j4| Yq, Z, C) = Pr(A| Yoi Z. Ci), thus 
logit Pr(Y 0 = 1|A,Z,C') 

= a(l, Z, C)A — g(Z, C\) + logit Pr(Y 0 = l|Ci, C 2 ), 

where g(Z,C\ ) = log{exp{a(l, Z, C)} Pr(yl = l|Yo = 0, Z, C) + Pr(A = 

0| Yo = 0, Z, C)}. Since logit Pr(Yo = 1|C) is linear in C 2 , so does logit Pr(Yo = 
lj A,Z,C). 


APPENDIX D: LOCAL EFFICIENCY 

Let ( A,L ) = (A,Yo, Z, C) and O = (A,Y,Z,C ) denote the full data 
and observed data respectively. Assume logit 7r(Yo,Z, C) = a(Yo,Z, C) + 
/3(Z, C), where (3(Z,C ) is unrestricted, a(Yo,Z, C) is known and assume 
Yo _IL Z|C. First, we derive the observed data orthogonal tangent space. All 
the scores of /(Yo, Z, C) can be written as 


M = {S(Y 0 , Z, C) : S(Y 0 , Z, C) = 5(Y 0 |C) + S(Z\C) + 5(C)}. 

where F{5(Y 0 |C)|C} = F{5(Z|C)|C} = F{5(C)} = 0. Therefore, by 
Bickel et al. (1998) and Tchetgen Tchetgen, Robins and Rotnitzky (2010), 


o,z,c 
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we can show that 

Mi = M*o, Z, C) - vHy 0i Z, C ) : for any v(Y 0 , Z, C)}. 

where v^{Yo, Z, C; <j>) = E{v\Z,C\(j>) + E(v\Yq,C;( f>) — E(v\C\<j>). Let M 2 
denote the tangent space of all the treatment propensity score. Thus, we 
have 

M 2 = |{A — 7r C^)} U (Z, C ) for all u 

Therefore, the tangent space for the full data ( A , L) is M = N\ ® A/ 2 , where 
® denotes direct summation of spaces. Rotnitzky and Robins (1997) showed 
that the observed data tangent space is given by M° = M '-p + A/p, where 
M° = R(g on,), R (•) is the range of the operator g : Q( a> L) —y and g 
is the conditional expectation operator g (•) = E [• |O] , a,l ) and are 
the spaces of all random functions of (A, L) and O respectively. Llj is the 
Hilbert space projection operator from onto Mj and S is the close 

linear span of the set S. 

As shown in Bickel et al. (1998), the orthocomplement to the tangent 
space in the observed data model A f 0,± = M{ ’ n A/p’ ■ Rotnitzky and 
Robins (1997) established that 


M° ,Jk = 

where 


1 - A 

1 -tt(L) 


m{L) + N car : m ( L) G A/f 1 and N car G M a 


M car = { Ak ( O )-—— i? [Ak ( O ) | L\ : for any k ( O ) G \ . 

{ 1 — 7T (L) ‘ J 

Therefore, by the formula of Mi, we have A/y ,_L consists of functions 


1 - A 

1 - 7T (L) 


{v - J} + Ak ( O ) - ^t. E [Ak (O) | L\. 


1 -7 t(L)~ 


Also, Rotnitzky and Robins (1997) establish that A/p = (6(0) : b (O) G A/p } • 
Therefore, M°A = |A', 0i± G A/' 1 °’ _L : E N 2 N?’ J 
have the following result. 


= 0, A ^2 G M 2 1 • Thus, 


we 


Lemma 1. We have 
1 - A 


j + Ak v (O) - Y~^T]Y E [Akv ( O ) | L\ : 


M 0,± = { 1-7r (L) 


1 - 7t(L)' 

k v = E[v-v^\A = l,Z, C] 
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PROOF. IVf ,j - (. k v ) is clearly in A/f ’ > it suffices to show that N®’' 


IVf’" 1 " (k v ) is the unique solution to the equation E 
v and for all N 2 £ A/ 2 . In this vein, 




= 0, for all 


0 


E 




E 


(1 — A) { 7 ; — 7 ^} /{1 — 7T (L)} + Ak* (O) 
-(1 -A)E[Ak* (0)\L] /{I -tt(L)} 


(A-7t(L))u(Z, C) 


for all u. 


This is equivalent to 

(1 — A) { 7 ; — tA } /{1 — 7r (L)} + Ak* ( O ) 
— (1 ~ A)E [Ak* (O) \L\ /{I — 7r (L)} 


A 


(A — 7r (L)) |Z, C) 


= 0 


47 E (1 — 


(1 - 7T (L)) |u - 'u t j |Z, C - E [(1 - 7T (L))n (. L ) r (O) |Z, <7] 


-A [(1 - 7T (A)) A [Afc* (O) |A] |Z, C] = 0 


47 E 
47 E 


(1-7r (L)) {v - |Z, C -A [(1 - 7T (L)) fc* (O) |Z, C] = 0 


E 


{u - u t } | C, 


A = 1. Z 


-k*(0) A|Z, C 


= 0. 


Upon writing k* (O) = k\ (A) (1 — A) + k^ (Z, C) A, we have that (Z, C) = 
E [{ 7 ; — tA} |C, A = 1, Z] = k v , proving the result of Lemma 1. □ 


Note that simple algebra yeilds that A f°’~ L = , where 


(DA) 

= { \ —— ( 7 ; — tA) + ^— -E(v — tA|A = 1, Z, C; 7, £) : any function v = v (Yq , Z, 

( 1 — 7T 1 — 7T 

= | Q v - v f (Y, A, Z, C; 7, 0 : any function v = v(Y. Z,C) | , 

where 7r = ir(Y, Z, C ). 

Heretofore, we have derived the orthogonal tangent space assuming the 
selection bias a(Yo, Z, C) is known and lo_lLZ|C. We next show that assum¬ 
ing the selection bias function a(Yo, Z, C) is correctly specified and Yo-lLZ\C, 
the space of influence functions for all RAL estimators for the parameter of 
interest t/i is as follows: 


Jxf) 


_L 


(D.2) 

Q d+V -V t (D, A, Z, C; 7 , 0 - E[ V71 Q d+v _ v , (Y, A, Z, C; 7 , 0 \% 7 , £, 4 >) ■ 

any function v = v(Y, Z, C) 
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where d = d(A,Yo]ip) is any function proportional to AYo/Pr(A = 1) — ip 
and IF^ is the influence function of any RAL estimator fj for r] and IF ^ 
belongs to . Note that the estimators given in Proposition 4 is a subset 
of all RAL estimators for ip, their influence function also belongs to (D.2). 
More specifically, the influence functions of all estimators for ip given in 
Proposition 4 are: 

Qd(Y, 41, Z, C; 7 ,0 - E[v v Qd{Y, A, Z, C; 7 ,£)|7, 

where fj is the DR estimator of 7 . 

To derived the efficiency bound for all regular and asymptotically linear 
estimators for ip, let n(-| l 7 J -) denote the projection onto the Hilbert space 
. We have the following result for the efficiency. 

Proposition 6. Under (IV.1)-(IV.2) and condition 1, we have ip e ^ = 
Qg(Y,A,Z,C', r y^,£)/Pr(A = 1) — M is the efficient RAL estimator for ip 
with influence function 


Q- g {Y,A,Z,C-,j,0 

Pr(A = 1) 


M-ip + E[\y v ( 


Qg(y,A,Z,C; 7 ,Q 

Pr(A = 1) 


M)]IF vt , 


where 7 ^ = ( 77 ^, 0) , 77 ' is the most efficient estimator for rj and M = H{Qg{Y, A, Z, C; 7 I, £)/Pr(A 
1 ) 1 ^}- 


Proof. Recall logit 7 t(Yoj Z, C) = a(Yo, Z, C ) + /3(Z, C), where /3(Z, C) 
is unrestricted and a(0, Z, C) = 0. To derive the efficient influence function 
for ip, we consider the following three model spaces: 

(i) The selection bias a(Yo. Z, C) is known. 

(ii) The selection bias a(To, Z, C ) is known and Yq is independent with Z 
conditional on C, i.e., Lo -1L Z\C holds. 

(iii) The selection bias a(Yo, Z, C) is parametrically specified as a(Yo, Z, C; 7 ) 
and Y 0 is independent with Z conditional on C, i.e., Yq _IL Z\C, where 7 is 
unknown p-dimensional parameter. 

Note that the tangent space of (i) is the entire Hilbert space H since there 
is no additional restriction on the joint data likelihood. Robins, Rotnitzky 
and Scharfstein (2000) showed that the efficient influence function for ip is 
IF t( = C MY A, Z, C: 7 , 0/Pr(A = 1 ) - ip where g(Y, C ) = Y. 

For (ii), we have shown that the observed data orthogonal tangent space 
is J 1 - as given in (D.l). Hence, the influence function for ip in (ii) is IF .^ 2 = 
IF^fi + U(t) where U(t) G and t is the parameters in a parametric 
submodel. 
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For (iii), we first characterize the space of all influence functions for esti¬ 
mators for 77 , denoted as IF V . Note that 77 is variational independent of all the 
other nuisance parameters in (iii), i.e., ij is variational independent of all the 
parameters in (ii). Thus, the space for the influence functions of all RAL esti¬ 
mators for i] in (iii) is also . To derive the influence function for ip in (iii), 
note that E t {IF^^ 2 (ip(t), r](t), t)} = 0, thus ^ 7 t Et{IF^ 2 (ip(t),r}(t),t)} = 0, 
where V is the Laplace operator. Also, 


VtE t {IF^ 2 (ip(t),r](t),t)} 

= E t {IF^ : 2 (i/}(t),r](t),t)St(A, Y, Z, C)} + E t {\/^IF^ j 2 (ip(t),r](t), t)} Vt ip(t) 
+E t {\/r,IF^ 2 (i’(t),ri(t),t)} Vt v(t) + E t {s 7 tIF^ 2 (ip(t),rj(t),t)}. 

Note that 


IE^ 2 (ij>{t),ri(t),t) = IF^ f J(ip(t),ri(t),t) + U(t) 

= Q- g (Y , A, Z, C; t)/ Pr(A = 1; t) - ip + U(t), 

thus \/^IF^ j2 (ip(t), rj(t),t ) = —1. Due to the robustness of IF^-f (ip(t), rj(t),t)+ 

ip for E{Yq\A = 1 ,Z,C), we have E{\/ t IF^ f /(ip(t),r](t),t)} = 0. Simi¬ 
larly, the double robustness of U(t) indicates that E{\/ t U(t)j = 0. Thus, 
E{\7tIE^ 2 (ip{t),r)(t),t)} = 0. 

Since, 


dip 

~dt 


E t {IFi l , j 2 (ip(t),r l (t),t)S t (A, Y, Z, C)} + ^{ VtJ /^ )2 (V>(i), r,(t),t)} Vt v(t) 


Et 


IF^ 2 (7P(t)Al(t),t) + Et{v v IF^ 2 (m,v(t),t)}IF n }S t (A,Y,Z, C ) 


Thus, 


| IF^ 2 (ip(t),ri(t),t) + -EtlVr/LF^^W, i)}^! 

is the space of influence functions for all RAL estimators for ip which is also 
the observed data orthocomplement to the nuisance tangent space for model 
(iii). Note that IF^ 2 (ip(t),ri(t),t) = IFfj/J(ip(t), t) + U(t ), thus by 

choosing U(t) = — H{IF^(ip(t), r](t),t)\J L } £ J L and IF V = IFp/^. the 
influence function for an RAL estimator for ip is II (IF C ^[(ip{t),rj(t),t)\J) + 
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E[\/ V {H(IF^(i/}(t), rj(t), t)\J)}\IF^^. Note that this influence function is 
in the tangent space of the model, thus it is the efficient influence function 
for ip in model (iii). That is, 

iF e Ji = n(iF e J( (mMt), 

D 

Proposition 6 provides a theorical efficiency bound for all regular esti¬ 
mators of ETT. Finding the optimal function v such that is the most 
efficient estimator might be challenging in practice. For illustration, we de¬ 
rive the efficient influence function for i] and ip where Y and Z are both 
binary. Thus, u(Yq, Z, C) can be written as v(Yq, Z, C) = ho(C) + h\(C)Z + 
/ 12 (C)To + h(C)YoZ and thus v — = h(C)v(Yo, Z, C ) where v(Yq, Z, C ) = 

{To - E(Y 0 \C)}{Z - E(Z\C)}. Let U h (t) = Q v _ vi (Y,A,Z,C-, 7 ,£), thus, 
U h (t ) = h(C)A(t), where 

A(t) = Q„(Y,A,Z,C-,'y,£). 

To find the efficient influence function of ip DR , we first find the efficient 
influence function of r]. Let h opt denote the choice of h such that f) is the 
most efficient. We have U^ pt satisfies E[dUh/drj\ = E[UhU^ ptT ] for any h 
(Newey and McFadden 1994, Chap 36). Thus, the efficient estimator for rj 
satisfies V n [U ^ pt = 0. That is 

E[h(C ){^ + A(t)A(t) T h° ptT (C)}} = 0. 

Select h{C) = E[dA(t)/dr] T + A(t)A(t) T h optT (C)|C], thus 

h opt (C) = — E[A(t) A(t) T \C ]~ 1 E[dA(t) / dij T \C]. 

Thus, the efficient influence function for 77 is IF = h opt (C)A(t). Let 
H = Qg(Y, A, Z, C; 7 ^, £)/Pr(A = 1) thus M = Yi{H\J L }. Also note that 
M G J L , thus M could be written as M = h(C)A(t). Hence, we have 
E[{H — h{C)A{t)}h{C)A(t)] = 0 for any h(C). This is equivalent to E[{H — 
h(C)A(t)}A(t)\C] = 0. Thus, we obtain h(C) = E{A 2 (t)\C}~ 1 E{HA{t)\C} 
which yields M = E{A 2 (t)\C }~ 1 E{H A(t)\C}A(t) and by proposition 6, we 
obtain the efficient influence function for ip. 
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